Most of our variables of interest were complete.
| x | |
|---|---|
| id | 1.4061881 |
| adult | 1.4061881 |
| belongs_to_collection | 88.8165134 |
| budget | 1.4061881 |
| genres | 0.0000000 |
| homepage | 81.6139254 |
| imdb_id | 0.0374103 |
| original_language | 0.0242067 |
| original_title | 0.0000000 |
| overview | 2.0773734 |
| popularity | 1.4127899 |
| poster_path | 0.8208265 |
| production_companies | 0.0066018 |
| production_countries | 0.0066018 |
| release_date | 1.5932397 |
| revenue | 1.4127899 |
| runtime | 1.9717442 |
| spoken_languages | 1.4127899 |
| status | 1.5866379 |
| tagline | 55.7083755 |
| title | 1.4127899 |
| video | 1.4127899 |
| vote_average | 1.4127899 |
| vote_count | 1.4127899 |
| avg_rating | 0.0000000 |
We have data on movies from 1874 to 2020. The number of movies per year clearly increases over time.
| Statistic | Value |
|---|---|
| Min. | 1874.000 |
| 1st Qu. | 1979.000 |
| Median | 2001.000 |
| Mean | 1992.225 |
| 3rd Qu. | 2010.000 |
| Max. | 2020.000 |
| NA’s | 724.000 |
We have two populations of user’s movie ratings. With the “avg_rating” in movies corresponding to ratings of MovieLens users, and “vote_average” corresponding to TMDb users. There are a couple extreme outliers with average ratings of 0 and vote counts greater than 30. For simplicity and to get a better representation of our scatterplots we are going to exclude these two points.
| Vote Average for TMDb Users | Value |
|---|---|
| Min. | 0.00000 |
| 1st Qu. | 5.00000 |
| Median | 6.00000 |
| Mean | 5.65042 |
| 3rd Qu. | 6.80000 |
| Max. | 10.00000 |
| NA’s | 642.00000 |
| Vote Average for MovieLens Users | Value |
|---|---|
| Min. | 0.500000 |
| 1st Qu. | 2.687500 |
| Median | 3.161670 |
| Mean | 3.060668 |
| 3rd Qu. | 3.500000 |
| Max. | 5.000000 |
| Vote Count | Value |
|---|---|
| Min. | 0.0000 |
| 1st Qu. | 3.0000 |
| Median | 10.0000 |
| Mean | 111.5553 |
| 3rd Qu. | 35.0000 |
| Max. | 14075.0000 |
| NA’s | 642.0000 |
Correlation Matrix
Looking at the correlation matrix, there doesn’t seem to be any real strong linear relationships between any of these variables and average movie ratings. But perhaps there are some non-linear relationships.
| id | budget | popularity | revenue | runtime | vote_average | vote_count | avg_rating | year | |
|---|---|---|---|---|---|---|---|---|---|
| id | 1.00 | -0.09 | -0.06 | -0.06 | -0.09 | 0.00 | -0.05 | -0.05 | 0.32 |
| budget | -0.09 | 1.00 | 0.45 | 0.77 | 0.13 | 0.04 | 0.68 | 0.03 | 0.13 |
| popularity | -0.06 | 0.45 | 1.00 | 0.51 | 0.12 | 0.10 | 0.56 | 0.08 | 0.13 |
| revenue | -0.06 | 0.77 | 0.51 | 1.00 | 0.10 | 0.08 | 0.81 | 0.06 | 0.09 |
| runtime | -0.09 | 0.13 | 0.12 | 0.10 | 1.00 | 0.11 | 0.11 | 0.12 | 0.08 |
| vote_average | 0.00 | 0.04 | 0.10 | 0.08 | 0.11 | 1.00 | 0.12 | 0.48 | -0.04 |
| vote_count | -0.05 | 0.68 | 0.56 | 0.81 | 0.11 | 0.12 | 1.00 | 0.11 | 0.11 |
| avg_rating | -0.05 | 0.03 | 0.08 | 0.06 | 0.12 | 0.48 | 0.11 | 1.00 | -0.04 |
| year | 0.32 | 0.13 | 0.13 | 0.09 | 0.08 | -0.04 | 0.11 | -0.04 | 1.00 |
Density plots for popularity, vote count, runtime and average vote.
We are going to assume that to have an unbiased estimate of a movie’s true average rating, there must be at least 30 votes, and we find 12,439 movies that meet this criteria. However, a potential source of bias with this approach is that it could be that movies with very low vote totals, are not very good movies to begin with and aren’t popular. Thus, that could be why they have low vote counts and possible low average ratings.
It could be hard to see the actual trend with the extreme values of vote count.
Even with removing the extreme values of vote count, it’s still hard to see much of a releationship between vote count and vote average.
It seems there is a slight negative relationship between average vote and when the movie was released.
There are 69 movies with runtime equal to zero.
| Runtime | Value |
|---|---|
| Min. | 0.0000 |
| 1st Qu. | 91.0000 |
| Median | 101.0000 |
| Mean | 103.7956 |
| 3rd Qu. | 114.0000 |
| Max. | 877.0000 |
| NA’s | 4.0000 |
## # A tibble: 1 x 1
## n
## <int>
## 1 69
Hard to see association with the extreme values of runtime. However, even with the removal of outliers it’s still hard to see much of an association between runtime and average movie rating.
Not sure how popularity is measured.
| Popularity | Value |
|---|---|
| Min. | 0.002538 |
| 1st Qu. | 4.002335 |
| Median | 6.505146 |
| Mean | 7.779560 |
| 3rd Qu. | 9.698631 |
| Max. | 547.488298 |
Even with the removal of extreme outliers, there seems to be no relationship between movie popularity and average movie rating.
The top three most common words are woman, relationship and independent.
| Word | Frequency | |
|---|---|---|
| woman | woman | 3621 |
| relationship | relationship | 2094 |
| independ | independ | 1952 |
| base | base | 1872 |
| murder | murder | 1868 |
| love | love | 1543 |
| music | music | 1473 |
| war | war | 1451 |
| nuditi | nuditi | 1428 |
| sex | sex | 1074 |
Breakdown of movies by cast gender:
Here we have two different views of the association.
## # A tibble: 93 x 2
## year Mean
## <int> <dbl>
## 1 1915 11000000
## 2 1921 2500000
## 3 1923 623
## 4 1924 1213880
## 5 1925 1272550
## 6 1927 325272.
## 7 1930 7940
## 8 1931 4343790
## 9 1932 1597000
## 10 1933 6140500
## # … with 83 more rows
Average budget, revenue, and profit:
| x | |
|---|---|
| Budget | 34.14421 |
| Revenue | 99.88446 |
| Profit | 65.74025 |
Correlations
| popularity | budgetM | revenueM | profitM | runtime | |
|---|---|---|---|---|---|
| popularity | 1.0000000 | 0.2903461 | 0.4301078 | 0.4274831 | 0.0749828 |
| budgetM | 0.2903461 | 1.0000000 | 0.7218920 | 0.5724324 | 0.1661427 |
| revenueM | 0.4301078 | 0.7218920 | 1.0000000 | 0.9806458 | 0.1732649 |
| profitM | 0.4274831 | 0.5724324 | 0.9806458 | 1.0000000 | 0.1582931 |
| runtime | 0.0749828 | 0.1661427 | 0.1732649 | 0.1582931 | 1.0000000 |
Two main outliers:
| title | budgetM | revenueM | |
|---|---|---|---|
| 23 | Pirates of the Caribbean: On Stranger Tides | 380 | 1046 |
| title | budgetM | revenueM |
|---|---|---|
| Avatar | 237 | 2788 |
Revenue = -3.34 + 3.02*(Budget) Need to spend money to make money.
##
## Call:
## lm(formula = revenueM ~ budgetM, data = metascrubdf)
##
## Residuals:
## Min 1Q Median 3Q Max
## -678.58 -43.24 -6.87 18.09 2074.84
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.34092 2.22597 -1.501 0.133
## budgetM 3.02322 0.04164 72.612 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 119.3 on 4845 degrees of freedom
## Multiple R-squared: 0.5211, Adjusted R-squared: 0.521
## F-statistic: 5273 on 1 and 4845 DF, p-value: < 2.2e-16
Correlation is .43 Popularity is not highly correlated with profit
## [1] 0.4274831
Outliers: popularity: Minions profit: Avatar
Action, Adventure, Comedy, and Drama top the list.
| Main Genre | Profit |
|---|---|
| Action | 74585 |
| Adventure | 56355 |
| Comedy | 45314 |
| Drama | 41911 |
| Animation | 25554 |
| Science Fiction | 12176 |
| Fantasy | 12056 |
| Horror | 11471 |
| Family | 9973 |
| Thriller | 9206 |
| Crime | 7377 |
| Romance | 5980 |
| Mystery | 2317 |
| History | 1292 |
| Music | 882 |
| War | 756 |
| Western | 708 |
| Documentary | 528 |
| TV Movie | 37 |
| Foreign | 17 |
Profit and profit variability increase with budget. The top 10% budgeted movies have the most profit but also the most varibility.
Mean of 111 minutes with many outliers above the upper quartile
## [1] 111.0433
Popularity corresponds with profit above
Here are the top ten most profitable collections:
| Collection | profit |
|---|---|
| Star Wars Collection | 6579 |
| Harry Potter Collection | 6427 |
| James Bond Collection | 5566 |
| The Fast and the Furious Collection | 4115 |
| Transformers Collection | 3401 |
| Despicable Me Collection | 3393 |
| Pirates of the Caribbean Collection | 3272 |
| The Twilight Collection | 2957 |
| Ice Age Collection | 2788 |
| Jurassic Park Collection | 2653 |